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DICTIONARY  LEARNING  AND  SPARSE  CODING  FOR  UNSUPERVISED  CLUSTERING 


Pablo  Sprechmann  and  Guillermo  Sapiro 
University  of  Minnesota 


ABSTRACT 

A  clustering  framework  within  the  sparse  modeling  and  dictionary 
learning  setting  is  introduced  in  this  work.  Instead  of  searehing 
for  the  set  of  centroid  that  best  fit  the  data,  as  in  k-means  type 
of  approaches  that  model  the  data  as  distributions  around  discrete 
points,  we  optimize  for  a  set  of  dictionaries,  one  for  each  cluster, 
for  which  the  signals  are  best  reconstructed  in  a  sparse  coding  man¬ 
ner.  Thereby,  we  are  modeling  the  data  as  the  of  union  of  learned 
low  dimensional  subspaces,  and  data  points  associated  to  subspaces 
spanned  by  just  a  few  atoms  of  the  same  learned  dictionary  are  clus¬ 
tered  together.  Using  learned  dictionaries  makes  this  method  robust 
and  well  suited  to  handle  large  datasets.  The  proposed  clustering 
algorithm  uses  a  novel  measurement  for  the  quality  of  the  sparse  rep¬ 
resentation,  inspired  by  the  robustness  of  the  ii  regularization  term 
in  sparse  coding.  We  first  illustrate  this  measurement  with  examples 
on  standard  image  and  speeeh  datasets  in  the  supervised  elassifi- 
cation  setting,  showing  with  a  simple  approach  its  discriminative 
power  and  obtaining  results  comparable  to  the  state-of-the-art.  We 
then  conclude  with  experiments  for  fully  unsupervised  clustering  on 
extended  standard  datasets  and  texture  images,  obtaining  excellent 
performance. 

Index  Terms —  Clustering,  sparse  representations,  dictionary 
learning,  subspaee  modeling,  texture  segmentation. 

1.  INTRODUCTION 

In  recent  years,  sparse  representations  have  received  a  lot  of  attention 
from  the  signal  proeessing  community.  This  is  due  in  part  to  the  faet 
that  an  important  variety  of  signals  such  as  audio  and  natural  images 
can  be  well  approximated  by  a  linear  combination  of  a  few  elements 
(atoms)  of  some  (often)  redundant  basis,  usually  called  dictionaries. 
See  [1]  and  references  therein  for  a  review. 

Sparse  modeling  aims  at  learning  these  non  parametric  dictio¬ 
naries  form  the  data  itself.  Several  algorithms  have  been  developed 
for  this  task,  e.g.,  the  K-SVD  and  the  method  of  optimal  direetions 
(MOD)  (see  for  example  [2]  and  references  therein).  Recent  publi- 
eations  in  a  wide  spectrum  of  signals  and  applications  have  shown 
that  this  approach  can  be  very  successful,  leading  to  state-of-the  art 
results,  e.g.,  in  image  restoration  and  denoising,  texture  synthesis, 
and  texture  classification. 

In  the  classifieation  setting,  this  class  of  algorithms  learn  die- 
tionaries  from  the  labeled  training  dataset  and  use  the  features  of 
the  sparse  decomposition  of  the  testing  signal  for  classification  (see 
[2,  3,  4]  and  references  therein).  One  contribution  of  our  work  is  to 
extend  these  classification  strategies  to  the  fully  unsupervised  setting 
of  data  elustering. 


Work  supported  by  ONR,  NGA,  ARO,  DARPA,  and  NSF.  We  thank  1. 
Ramirez,  F.  Lecumberry,  and  J.  Mairal  for  very  useful  discussions  and  fun¬ 
damental  software. 


In  this  paper  we  propose  an  algorithm  for  clustering  datasets 
that  are  well  represented  in  the  sparse  modeling  framework  with  a 
set  of  learned  dictionaries.  The  main  idea  is,  given  the  number  of 
clusters  K,  we  find  the  optimal  K  dictionaries  for  representing  the 
data,  and  then  associate  each  signal  to  the  dictionary  for  which  the 
“best”  sparse  decomposition  is  obtained.^  This  is  achieved  by 

K 

min  E  E  (1) 

z=l  yijECi 

where  G  is  the  -atoms  dictionary  associated  with  the 

class  Ci,  Xj  G  R^  are  the  data  vectors,  and  7^  is  a  function  that  mea¬ 
sures  how  good  the  sparse  decomposition  for  the  signal  Xj  under  the 
dictionary  Di  is.  In  the  general  ease,  different  dictionaries  may  have 
different  number  of  atoms,  ki  might  be  cluster  dependent.  This  prob¬ 
lem  is  closely  related  with  the  k-q-fiai  algorithm  that  aims  at  finding 
the  closest  k  ^-dimensional  fiats  to  a  dataset  [7].  However,  there  are 
major  differences  between  the  two.  In  particular,  the  framework  here 
proposed,  following  the  sparse  representation  approach,  does  not  as¬ 
sume  a  pre-defined,  or  even  constant  across  classes,  (q)  dimension, 
resulting  in  a  richer  space  for  representing  and  clustering  the  signals. 

We  propose  a  measurement  1Z  for  the  quality  of  the  sparse  rep¬ 
resentation  that  naturally  takes  into  account  both  the  reconstruction 
error  and  the  sparseness  (complexity)  of  the  representation  on  the 
corresponding  learned  dictionary.  In  practice  this  measurement  has 
shown  enormous  discrimination  power.  To  further  show  this  we  per¬ 
formed  experiments  in  the  supervised  classification  setting  using  la¬ 
beled  data;  we  first  learned  a  dictionary  for  each  class,  and  then  clas¬ 
sify  each  testing  signal  aecording  to  this  measure.  This  very  simple 
approach  gives  results  comparable  with  the  state-of-the-art  for  sev¬ 
eral  benchmark  datasets. 

The  proposed  clustering  algorithm  minimizes  (1)  using  a  k- 
means  type  of  approach  that  learns  a  dictionary  for  each  cluster 
and  refines  it  through  the  iterations.  Experimentally,  excellent  per¬ 
formance  is  obtained,  both  on  standard  datasets  and  on  texture 
segmentation  tasks. 

In  the  unsupervised  clustering  case,  the  initialization  is  very  im¬ 
portant  for  the  suceess  of  the  algorithm.  Due  to  the  cost  associated 
with  the  procedure,  repeating  random  initializations  is  practieally 
impossible.  Thus  a  “smart”  initialization  is  needed.  We  propose  an 
approach  that  combines  sparse  coding  with  spectral  clustering  [8]. 

Similar  ideas  to  the  ones  here  proposed  where  previously  em¬ 
ployed  for  subspace  clustering  [9,  10],  clustering  using  the  so-called 
ii-graph  by  Huang  and  Yan  (see  description  in  [11]),  and  label 
propagation  [12].  In  contrast  with  our  proposed  dictionary  learning 
framework,  these  very  inspiring  approaches  all  use  the  data  itself  as 

^Note  that  it  is  not  that  each  data  point  belongs  to  a  union  of  subspaces  as 
for  example  in  [5,  6].  Comparing  with  block/group  sparsity,  here  a  single  dic¬ 
tionary  (block)  is  selected  per  data  point,  and  the  point  is  sparsely  represented 
(subspace)  with  atoms  only  from  this  dictionary. 


dictionary,  sparsely  representing  every  data  point  as  a  linear  combi¬ 
nation  of  the  rest  of  the  data.  Such  representation  is  computationally 
expensive  (virtually  unusable  for  datasets  of  thousands  of  points).  In 
addition,  the  large  redundancy  and  coherence  expected  from  using 
the  data  itself  as  dictionary  is  prompt  to  make  the  sparse  coding 
very  unstable,  it  is  well  know  that  such  coding  techniques  strongly 
depend  on  the  internal  coherence  of  the  dictionary.  Furthermore, 
the  performance  of  these  methods  decreases  when  the  number  of 
clusters  grows.  We  propose  as  part  of  our  framework  a  method  to 
bypass  this  problem  that  divides  the  clustering  problem  into  several 
binary  ones.  In  a  natural  way,  we  use  the  energy  function  to  decide 
which  partition  to  choose.  Such  binary  division  framework  is  not  so 
natural  for  these  other  related  clustering  methods. 

The  remainder  of  this  paper  is  organized  as  follows:  In  Section  2 
we  briefly  summarize  the  main  ideas  of  sparse  coding  and  dictionary 
learning.  In  Section  3  we  deflne  the  measure  IZ  and  analyze  its  dis¬ 
criminative  power.  In  Section  4  we  present  the  proposed  clustering 
algorithm  and  in  Section  5  the  corresponding  experimental  results 
for  clustering  and  texture  segmentation.  Finally,  we  conclude  the 
paper  in  Section  6. 


2.  SPARSE  CODING  AND  DICTIONARY  LEARNING 


Sparse  coding  means  to  represent  a  signal  as  a  linear  combination  of 
a  few  atoms  of  a  given  (often  overcomplete)  dictionary.  Mathemati¬ 
cally,  given  a  signal  x  G  and  a  dictionary  D  G  the  sparse 

representation  problem  can  be  stated  as 

min||a||o  s.t.  x  =  Da,  (2) 

a. 

where  ||a||o  is  the  “^o-norm”^  of  the  coefficient  vector  a  G  R^, 
the  number  of  non-zero  elements.  This  problem  is  NP-hard,  thus  is 
commonly  approximated  substituting  the  ^i-norm  in  Equation  (2). 
In  the  noisy  case  the  equality  constraint  must  be  relaxed  as  well.  An 
alternative  to  this  is  then  to  solve  a  Lasso-type  problem, 

min  ||x  -  Da||2  +  A||a||i,  (3) 

a. 

where  A  is  a  parameter  that  balances  the  tradeoff  between  reconstruc¬ 
tion  error  and  sparsity.  It  is  a  well  known  fact  that  in  general  the 
constraint  induces  sparse  solutions  for  the  coefficient  vectors  a.  Fur¬ 
thermore,  this  is  a  convex  problem  that  can  be  solved  very  efficiently 
using  for  example  the  LARS-Lasso  algorithm  [13].  This  alternative 
has  also  been  shown  to  be  more  stable  than  the  approach  in  the 
sense  that  in  the  latter,  small  variations  in  the  input  signal  can  pro¬ 
duce  very  different  active  sets  (the  set  of  non-zero  coefficients  in  a, 
or  selected  atoms  from  D). 

Now,  what  about  the  actual  dictionary  D?  State-of-the-art  re¬ 
sults  have  shown  that  it  should  in  general  be  learned  from  data. 
Given  a  set  of  signals  {yii}i=i...m  in  R^,  the  goal  is  to  And  a  dictio¬ 
nary  D  G  R^  ^  ^  such  that  each  signal  in  the  set  can  be  represented 
as  a  sparse  linear  combination  of  its  atoms.  In  this  work  we  use 
an  variation  of  MOD  following  [14].  The  algorithm  learns  the 
dictionary  by  solving  the  following  optimization  problem: 


min 


m 

^  ||xi  -  T>ai\\l  +  A||ai||i, 
i=l 


(4) 


restricting  the  atoms  to  have  unit  Euclidean  norm.  The  optimization 
is  earried  out  using  an  iterative  approach  that  is  composed  of  two 

^Although  this  is  normally  refered  as  a  norm  counting  the  non-zero  ele¬ 
ments  of  a  vector,  it  is  actually  a  pseudo-norm. 


Dataset 

Proposed 

A 

B 

C 

SVM 

k-NN 

MNIST 

1.26 

3.41 

1.05 

- 

1.4 

5.0 

USPS 

4.14 

3.56 

4.38 

6.05 

4.2 

5.2 

ISOLET 

3.27 

4.3 

3.4 

- 

3.3 

8.7 

Table  1.  Error  rate  (in  percentage)  for  the  classification  algorithm  dis¬ 
cussed  in  Section  3.  We  present  comparisons  with  recently  published  ap¬ 
proaches.  MNIST:  (A)  is  the  best  reconstructive  method  presented  in  [16], 
while  (B)  is  the  best  discriminative  one.  USPS:  (A)  is  the  best  reconstructive 
and  (B)  is  the  best  discriminative  methods  reported  in  [16].  (C)  is  the  best 
result  obtained  in  [17].  ISOLET:  (A)  is  the  supervised  k-q-fiats  and  (B)  is  the 
k-metrics  in  [18].  We  also  compare  with  a  SVM  with  Gaussian  kernel  and 
the  Euclidean  k-NN. 

(convex)  steps:  the  sparse  coding  step  on  a  flxed  D  and  the  dictio¬ 
nary  update  step  with  a  flxed  active  set. 

3.  THE  SPARSE  REPRESENTATION  QUALITY 

A  common  approach  when  using  dictionaries  for  classiflcation  is  to 
train  class  speciflc  dictionaries  using  labeled  data  and  then  assign 
each  testing  signal  to  the  class  for  which  the  best  reconstruction  is 
obtained  [2,  4].  The  measure  employed  for  this  task  is  often  the 
reconstruction  error,  7^(x,  D)  =  |  |x— Da|  ||,  where  a  is  the  optimal 
coefficient  vector  in  the  sparse  coding.  While  this  strategy  leads  to 
very  good  results,  it  does  not  take  into  account  the  actual  sparsity 
of  the  reconstruction.  Suppose  that  we  have  two  dictionaries  for 
which  almost  the  same  reconstruction  error  is  obtained,  but  one  of 
them  requires  double  the  atoms  than  the  other.  In  such  a  situation 
one  would  rather  select  the  dictionary  that  gives  the  sparsest  solution 
(simplest  following  Akaike’s  Information  Principle  [15]),  even  if  the 
reconstruction  error  is  slightly  bigger. 

In  practice,  this  problem  can  be  addressed  using  a  small  pre- 
deflned  sparsity  level  L  in  an  £o  approach.  This  strategy  is  not  longer 
valid  when  the  convex  relaxation  of  Equation  (3)  is  employed.  In  this 
situation  comparing  the  reconstruction  errors  alone  has  little  mean¬ 
ing.  We  propose  then  to  use  the  actual  cost  function  in  the  Lasso 
problem  as  a  measure  of  performance,  as  used  in  the  dictionary 
learning  (4),  7^(x,  D)  =  ||x  —  Da|||  +  A||a||i,  where  as  before 
a  is  the  optimal  coefficients  vector.  This  alternative  naturally  takes 
into  account  both  the  reconstruction  error  and  the  complexity  of  the 
sparse  decomposition.  The  reconstruction  error  measures  the  qual¬ 
ity  of  the  approximation  while  the  complexity  is  measured  by  the 
norm  of  the  optimal  a. 

Let  Xi,  z  =  1, . . . ,  iT,  be  a  collection  of  K  (labeled)  classes  of 
signals  and  let  Di  be  the  corresponding  dictionaries  trained  for  each 
of  them  independently  following  for  example  (4).^  The  class  jo  for 
a  given  new  signal  x  is  found  by  solving  jo  =  argmin  7^(x,  Dj). 

j  =  l,...,K 

This  procedure  is  very  simple  and  has  only  two  parameters:  the 
penalty  parameter  A  and  the  size  of  the  dictionaries  k.  Both  can 
be  selected  via  cross-validation. 

As  a  way  to  evaluate  the  discriminatory  power  of  the  measure 
just  introduced,  which  will  also  be  used  for  the  proposed  unsuper¬ 
vised  clustering  approach,  we  test  this  simple  classiflcation  method 
with  standard  datasets,  the  MNIST  and  USPS  digit  datasets  and  the 
ISOLET  data  that  consists  of  617  audio  features  extracted  from  200 
speakers  saying  each  letter  of  the  alphabet  twice.  We  used  in  every 
case  the  usual  training/testing  split.  In  Table  1  we  present  the  ob¬ 
tained  results.  We  compare  our  results  with  several  much  more  so¬ 
phisticated  classiflcation  algorithms.  The  results  obtained  are  com¬ 
parable  and  sometimes  even  better.  We  also  compare  with  the  stan¬ 
dard  Euclidean  k-NN  and  with  SVM  with  a  Gaussian  kernel.  In  all 

^See  also  [2]  for  cross-training. 


our  experiments  we  used  a  penalty  parameter  A  =  0.1.  The  size  of 
the  dictionary  depends  on  the  number  of  training  samples  as  well  as 
the  intrinsic  complexity  of  the  data.  For  the  MNIST  we  report  results 
for  a  dictionary  with  k  —  800,  k  —  300  for  the  USPS  digit  dataset, 
and  /c  =  100  for  the  ISOLET.  In  the  last  case  the  training  sample  is 
very  small,  making  it  impossible  to  choose  larger  dictionaries. 

One  could  think  of  using  the  whole  training  datasets  as  a  dictio¬ 
naries  for  each  class  as  with  the  approaches  mentioned  in  the  intro¬ 
duction  [9, 10, 11].  In  that  case,  in  all  our  experiments  the  error  rates 
obtained  are  not  better  than  the  ones  reported  in  Table  1.  Using  the 
data  as  dictionaries  has  the  disadvantage  that  the  computational  cost 
of  the  classification  is  prohibited,"^  and  the  method  is  highly  suscep¬ 
tible  to  label  errors  due  to  the  high  coherence  of  the  “dictionary.” 

4.  DICTIONARY  LEARNING  FOR  CLUSTERING 

We  now  proceed  to  present  the  main  contribution  of  this  pa¬ 
per,  namely,  extending  the  dictionary  learning  and  sparse  coding 
frameworks  to  unsupervised  clustering.  Given  a  set  of  signals, 
in  and  the  number  of  clusters/classes,  we 
want  to  find  the  set  of  K  dictionaries  Di  G  z  =  1, . . . ,  K, 

that  best  represents  the  data.  We  formulate  this  as  an  energy  mini¬ 
mization  problem  of  the  form  of  Equation  (1),  and  use  the  measure 
proposed  in  Section  3, 

K 

E  E  min  ||xj  —  1 12  +  A|  I  |i,  (5) 

^  ^  z=l  Xj 

where  as  before,  the  atoms  of  all  the  dictionaries  are  restricted  to 
have  unit  norm.  The  optimization  is  carried  out  iteratively  using 
a  Lloyd’s-type  algorithm  solving  one  problem  at  a  time:  Assign¬ 
ment  step:  The  dictionaries  are  fixed  and  each  signal  is  assigned 
to  the  cluster  for  which  the  best  representation  is  obtained:  Cj^  := 
|x  :  7^(x,  D^q)  <  7^(x,  Di)  Vz  =  1, . . . ,  k|.  Update  step:  The  new 
dictionaries  are  computed  fixing  the  assignations  found  in  the  previ¬ 
ous  step.  This  is  the  dictionary  learning  problem  (4). 

The  algorithm  stops  when  the  relative  change  in  the  energy  is 
less  than  a  given  constant.  In  practice  few  iterations  are  needed  to 
reach  good  results.  While  the  energy  is  being  reduced  at  every  step, 
there  is  no  guarantee  of  arriving  to  a  global  minimum.  In  this  setting, 
repeated  initializations  are  computationally  very  expensive,  thus  a 
good  initialization  is  required.  This  is  explained  next. 

4.1.  Initial  clusterization 

The  initialization  for  the  algorithm  presented  in  the  previous  section 
can  be  given  as  a  set  of  K  dictionaries  or  as  an  initial  partition  of  the 
data,  this  is  the  Ci  sets.  We  propose  two  closely  related  algorithms 
one  corresponding  to  each  of  these  two  alternatives.  In  both  cases 
the  main  idea  is  to  construct  a  similarity  matrix  and  use  it  as  the 
input  for  a  spectral  clustering  algorithm. 

Let  Do  G  R^^^o  be  an  initial  dictionary,  e.g.,  trained  to  recon¬ 
struct  the  data  for  the  whole  (unlabeled)  set  X  :=  [xi , . . . ,  Xm]  •  For 
each  signal  Xj  we  have  the  corresponding  sparse  representation  aj , 
lets  define  A  =  [ai, . . . ,  am]  G  R^oxm  signals  belonging  to 
the  same  cluster  are  expected  to  have  decompositions  that  use  sim¬ 
ilar  atoms.  Thus  one  can  measure  the  similarity  of  two  signals  by 
comparing  the  corresponding  sparse  representations.  Inversely,  the 

^With  our  method,  there  is  the  cost  of  learning  the  dictionaries,  but  this  is 
only  performed  once  off-line  before  the  classification. 

^When  K  is  over-estimated,  a  micro-detailed  partition  is  observed. 


similarity  of  two  atoms  can  be  determined  by  comparing  how  many 
signals  use  them  simultaneously,  and  how  they  contribute,  in  their 
sparse  decomposition.  We  compute  two  matrices  representing  each 
one  of  these  cases  respectively: 

Clustering  the  signals:  Construct  a  similarity  matrix  Si  G 

51  :=|AnA|. 

Clustering  the  atoms:  Construct  a  similarity  matrix  S2  G 

52  |A||Ar. 

In  both  cases  the  similarity  matrix  obtained  is  positive  semidefi- 
nite  and  can  be  associated  with  a  graph,  Gi  :=  {X,  Si}  and  G2 
{D,  S2},  where  the  data  or  the  atoms  are  the  sets  of  vertexes  with 
the  corresponding  S^  as  edge  weights  matrixes.  This  graph  is  par¬ 
titioned  using  standard  spectral  clustering  algorithms  to  obtain  the 
initialization  for  the  algorithm  described  in  the  previous  section. 

As  we  mentioned  before,  Gi  is  closely  related  with  the  ^1 -graph. 
In  that  case,  the  weights  of  the  graph  are  determined  using  the  sparse 
decomposition  of  the  signals  with  the  data  itself  as  a  dictionary. 
When  the  number  of  signals  m  is  large,  the  computational  cost  of 
constructing  the  similarity  matrix  is  too  expensive.  Also  the  spectral 
clustering  algorithm  requires  the  computation  of  the  largest  singular 
values  (and  corresponding  singular  vectors),  which  is  also  computa¬ 
tionally  demanding  when  m  is  large  (although  not  so  demanding  if 
only  a  few  eigenvectors  are  needed).  In  the  case  of  G2,  clustering 
the  atoms  bypasses  these  difficulties:  the  size  of  S2  depends  only 
on  the  significantly  smaller  size  of  the  initial  dictionary  k^.  This  pa¬ 
rameter  does  not  depend  on  the  amount  of  data,  it  just  needs  to  be 
large  enough  to  model  it  properly,  and  is  often  just  in  the  hundreds. 
Note  that  the  obtained  sub-dictionaries  may  have  different  cardinali¬ 
ties  (different  ki),  refiecting  different  complexities  of  the  associated 
clusters. 

When  the  number  of  clusters,  K,  is  large,  the  performance  of  the 
initial  clusterization  decreases.  We  propose  a  more  robust  initializa¬ 
tion.  Starting  with  the  whole  set  as  the  only  partition,  at  each  itera¬ 
tion  we  subdivide  in  two  sets  each  of  the  current  partitions,  keeping 
the  division  that  produces  the  biggest  decrease  in  the  cost  energy  de¬ 
fined  in  Equation  (5).  The  procedure  stops  when  the  desired  number 
of  clusters  is  reached.  This  can  be  applied  for  any  of  the  two  graphs 
presented  in  this  section,  and  such  partition  is  consistent  with  the 
energy  driving  the  clustering. 

Einally,  let  us  make  an  important  observation.  Let’s  consider 
the  ideal  situation  in  which  every  signal  in  the  K  clusters  can  be 
exactly  reconstructed  as  a  sparse  linear  combination  of  the  atoms 
of  a  dictionary,  and  that  the  subspace  that  they  span  (using  all  the 
atoms)  are  independent.  Assume  that  the  initial  dictionary  is  com¬ 
posed  by  K  (redundant)  sub-dictionaries.  Do  =  [Di, . . . ,  D^],  one 
corresponding  to  each  cluster  in  the  dataset.  Then,  given  a  signal  x 
belonging  to  one  of  them,  it  is  easy  to  show  that  the  optimal  a  in  the 
-relaxation  of  problem  (2)  with  this  Do,  will  use  only  atoms  from 
the  correct  block  of  the  initial  dictionary,  producing  K  connected 
components  in  both  graphs  Gi  and  G2.  In  that  situation  a  spectral 
clustering  technique  will  successfully  separate  the  clusters.  A  proof 
of  a  similar  result  is  presented  in  [10]. 

5.  CLUSTERING  RESULTS 

We  now  apply  the  proposed  classification  algorithm  to  several  clus¬ 
tering  problems  and  texture  segmentation.  We  clustered  the  digits 
form  0  to  4  (A  =  5)  using  the  testing  set  of  MNIST  and  the  train¬ 
ing  set  of  USPS.  We  also  clustered  the  last  six  letters  of  ISOLET, 
A  =  6,  combining  the  standard  training  and  testing  sets. 

Eor  the  USPS  and  the  MNIST,  we  used  an  initial  dictionary  of 
kQ  —  500  atoms,  and  ko  —  560  (80  x  7)  for  ISOLET.  We  used  Gi  for 


Fig.  1.  Results  obtained  for  texture  segmentation  using  the  proposed  algo¬ 
rithm.  The  images  (a)-(c)  are  mosaics  from  the  Brodatz  database.  The  ob¬ 
tained  results  are  shown  in  figures  (d)-(f),  having  1.74%,  0,25%  and  4.25% 
of  misclassified  pixels  respectively  (such  misclassifications  appear  at  the  re¬ 
gions  boundaries,  corresponding  patches  include  class  mixtures).  In  images 
(g)  and  (h)  we  show  selected  atoms  of  the  final  sub-dictionaries  obtained 
for  image  (a).  The  texture  in  the  circle  required  ki  =82  atoms,  while  the 
other  one  received  k2  =  118,  which  goes  along  with  the  intuition  of  larger 
complexity  for  this  texture.  The  dictionary  learned  in  the  initialization  had 
K  X  100  atoms,  where  again  K  is  the  number  of  textures  (clusters)  in  the 
image. 


initialization,  using  during  the  iterations  dictionaries  of  200  and  100 
atoms  respectively.  In  all  the  cases  it  was  easy  to  identify  the  clusters 
with  one  of  the  classes.  For  the  MNIST  we  had  a  misclassification 
rate  of  1.44%,  for  the  USPS  we  obtained  1.6%  misclassification  (and 
7.2%  for  digits  0-8),  and  13%  for  ISOLET.  In  the  last  case  most 
errors  where  confusions  between  the  letters  U-W  and  T-Z,  which 
have  very  similar  sounds.  Note  that  these  results  are  overall  not  far 
from  those  obtained  with  supervised  learning  and  classification. 

Finally,  we  use  our  clustering  algorithm  for  the  texture  segmen¬ 
tation  problem.  The  approach  is  related  to  the  one  used  in  [4]  for 
the  supervised  case.  Overlapped  16  x  16  patches  were  extracted 
from  the  original  images  and  used  as  input  signals  to  our  algorithm. 
Since  the  borders  on  the  mosaic  images  are  soft,  before  each  iteration 
(thus,  before  recomputing  the  dictionaries),  we  applied  a  Gaussian 
filter  to  smooth  the  segmented  regions.  In  Figure  1  we  show  some 
of  the  results.  The  number  of  patches  extracted  was  on  the  order  of 
several  thousands,  so  the  initialization  with  G2  was  applied.  The  al¬ 
gorithm  gave  sub-dictionaires  that  have  a  cardinality  that  intuitively 
reflects  the  complexity  of  the  corresponding  texture  (in  other  words, 
ki  was  not  constant).  We  got  very  low  rates  of  miss-clustered  pix¬ 
els,  for  example  in  image  (b)  we  got  0.25%  which  is  better  than  the 
0.37%  obtained  in  [2]  for  the  supervised  case  (which  was,  as  far  as 
we  know,  the  best  reported  result  in  the  literature  for  that  image). 

We  observed  that  best  results  are  obtained  for  all  the  experiments 


when  the  initial  dictionaries  in  the  learning  stage  are  constructed  by 
randomly  selecting  signals  from  the  training  set.  If  the  size  of  the 
dictionary  compared  to  the  dimension  of  the  data  is  small,  is  better 
to  first  partition  the  dataset  (using  for  example  Euclidean  k-means) 
in  order  to  obtain  a  more  representative  sample. 

6.  CONCLUDING  REMARKS 

A  framework  for  unsupervised  clustering  based  on  dictionary  learn¬ 
ing  and  sparse  representations  was  introduced  in  this  paper.  The 
basic  idea  is  to  simultaneously  learn  a  set  of  dictionaries  that  opti¬ 
mally  represent  each  one  of  the  clusters.  Toward  this  goal,  we  intro¬ 
duced  a  new  measurement  of  representation  quality  and  an  initializa¬ 
tion  procedure  that  combines  sparse  coding,  dictionary  learning  and 
spectral  clustering.  While  here  we  concentrated  on  hard  clustering, 
soft-clustering  can  be  obtained  as  well  in  this  framework. 

We  are  currently  pursuing  this  work  in  a  number  of  directions, 
including  the  incorporation  of  group  incoherence  terms  [5,  19]. 
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